Sequence-dependent thermodynamics of a coarse-grained DNA model 
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We introduce a sequence-dependent parametrization for a coarse-grained DNA model [T. E. Ouldridge, A. A. 
Louis, and J. P. K. Doye, J. Chem. Phys. 134, 085101 (2011)] originally designed to reproduce the properties 
of DNA molecules with average sequences. The new parametrization introduces sequence-dependent stacking 
and base-pairing interaction strengths chosen to reproduce the melting temperatures of short duplexes. By 
developing a histogram re weighting technique, we are able to fit our parameters to the melting temperatures 
of thousands of sequences. To demonstrate the flexibility of the model, we study the effects of sequence 
on: (a) the heterogeneous stacking transition of single strands, (b) the tendency of a duplex to fray at its 
melting point, (c) the effects of stacking strength in the loop on the melting temperature of hairpins, (d) the 
force-extension properties of single strands and (e) the structure of a kissing-loop complex. Where possible 
we compare our results with experimental data and find a good agreement. A simulation code called oxDNA, 
implementing our model, is available as free software. 



I. INTRODUCTION 

Living organisms store genetic information in DNA, a 
double-stranded polymer composed of a sugar-phosphate 
backbone with four different kinds of bases (adenine A, 
thymine T, cytosine C or guanine G) attached. The 
bases have highly anisotropic mutual interactions that 
are responsible for the formation of non-trivial struc- 
tures, such as helical double strands, primarily through 
hydrogen bonding and stacking interactions. To a first 
approximation, base-pairing occurs between complemen- 
tary base pairs (A-T and G-C). 1 Given the reliability and 
programmability of base-pair formation, DNA is an ob- 
vious candidate for use in self-assembly. Indeed, DNA 
has been exploited as a building block for the assembly 
of nanostructures and active devices: successes include 
DNA computation^ motors hierarchical self-assembly 
of tiles^ and self-assembly of strands into large structures 
such as DNA origamis. 6 

Many theoretical and computational approaches have 
been developed to study DNA. At the most fine-grained 
level, quantum chemistry calculations can be used to 
study the interactions between nucleotides.^^ While 
they provide valuable information about the ground state 
energies at a high level of detail, they are computation- 
ally demanding and do not allow for the study of dy- 
namical processes involving breaking and forming of base 
pairs. Molecular simulation packages such as AMBERp^ 
or CHARMMj^that retain an all- atom representation of 
the nucleic acids but use empirical force fields to model 
their interactions, are extensively used for computational 
studies of both DNA and RNA as well as their inter- 
actions with proteins.^ Although faster than quantum 
chemistry methods, they still are computationally very 
demanding and the time scales they can currently ac- 
cess are of the order of the /is, while many biologically 



and technologically relevant processes happen at the ms 
timescale or longer. At the moment, simulations of rare 
events such as the breaking of base pairs remain at the 
limit of what is possible. At the next level of complexity 
are coarser models of DNA^^Hlthat integrate out several 
degrees of freedom, such as replacing a group of atoms by 
a single site with effective interactions. While these mod- 
els cannot describe the system at the same level of detail 
as atomistic simulations, they allow one to study much 
larger systems and address rare events. Finally, continu- 
ous models of DNA^^ completely neglect the detailed 
chemical structure but allow for analytical treatment in 
the thermodynamic limit, and have been used to study 
macroscopic properties such as melting temperatures or 
properties of DNA under stressf^EU 

DNA nanotechnology exploits processes that include 
strand diffusion and breaking and forming of base 
pairs. Computational methods describing such systems 
must be efficient enough to access the time scales at 
which these processes happen. Moreover, the coarse- 
grained model must be properly designed to capture 
the structural, thermodynamical and mechanical prop- 
erties of DNA in both the single- and double-stranded 
forms. Such a coarse-graining approach was recently 
used to develop the nucleotide-level model of Ouldridge 
et al. ]24tt36] th a t wa s subsequently successfully applied to 
the study of DNA nano tweezers,^ kissing hairpins 
DNA walkers, 36 the nematic transition of dense solu- 
tions of short duplexes,^ and the formation of DNA 
cruciformsP^ The model was designed with an "average- 
base" representation that includes specificity of base- 
pairing but otherwise neglects the dependence of inter- 
actions on sequence. Consequently, the model is suited 
to study processes for which sequence heterogeneity is of 
secondary importance. 

Nevertheless, many biological processes and technolog- 
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ical applications of nucleic acids rely on sequence hetero- 
geneity. It is well-known that A-T and G-C pairs have 
different relative binding strength, 1 with the latter be- 
ing significantly stronger because of the presence of three 
rather than two interbase hydrogen bonds. Moreover, the 
stacking interactions that drive the coplanar alignment 
of neighboring bases are known to show significantly dif- 
ferent behavior depending on sequence. 1 Furthermore, a 
strand of DNA possesses directionality, e.g. the phos- 
phates of the backbone connect to the 3' and 5' carbon 
atoms in the sugars. Interactions within a strand are 
therefore distinct when the bases are permuted: for ex- 
ample, the interaction of neighboring AT bases depends 
on whether the A is in the 5' direction with respect to 
the T or vice versa. Besides thermodynamic properties, it 
has been observed that mechanical and structural prop- 
erties such as flexibility, helical twist and even helix type 
are also influenced by the sequence P^lffi] 

To highlight the effects of sequence on the thermody- 
namics of DNA, we point out that the melting temper- 
ature of two oligomers with the same length but differ- 
ent sequences can vary by more than 50 °C, as shown 
in Fig. Tla) where we compare the melting temperatures 
of poly(dA), poly(dG), poly(dCdG) and poly(dAdT)P 
sequences of various lengths at an equal strand concen- 
tration of 3.36 x 10 -4 M. These melting temperature 
differences are only marginally diminished with increas- 
ing length and are exploited in vivo, where, for example, 
it has been observed that initiation sites of transcription 
are often composed of a higher than average number of 
A-T pairsP 

Note that beside the number of A-T and G-C base 
pairs the actual order of nucleotides in the sequence is 
also important: two sequences of the same length and the 
same number of A-T and G-C base pairs can still have 
melting temperatures that differ by more than 10 °C, as 
shown in Fig. [ljb) . 

Given these large variations, it is important to have a 
model that captures at least the thermodynamic effects 
of sequence. We note that some of the other coarse- 
grained models of DNA that have been developed do in- 
clude sequence effects in various level of detail, includ- 
ing sequence dependent base-pairing interactions^ and 
also sequence-dependent stacking^ and cross-stacking 
interactions.^ An extension^ of the model in Ref. [16] 
also has base pair deformability parametrized to the 
values determined by analysis of DNA-protein crystal 
complexes In contrast to these models, the model pre- 
sented in Refs. 134} l35l and l36l was specifically developed 
for applications in DNA nanotechnology and was primar- 
ily designed to represent the single- to double-stranded 
transition in a sufficiently physical manner. The aim of 
this work is to introduce a parametrization of this model 
that captures the sequence-dependence of DNA thermo- 
dynamics and use it to study sequence effects on simple 
test systems. 

We first present the original coarse-grained DNA 
model of Ouldridge et a/P^36] anc [ then describe the fit- 




10 15 
strand length 



10 15 
strand length 



FIG. 1. (a) Melting temperatures versus duplex length 
as predicted by SantaLucia's nearest neighbor modeP^ for 
a duplex consisting of poly(dA), poly(dAdT), poly(dC) or 
poly(dCdG) and an average sequence, (b) Maximum (circles) 
and average (squares) difference in melting temperature for 
strands with nucleotide positions randomly permuted. The 
terminal base pairs are kept the same, thus neutralizing dif- 
ferent end effects. Data were generated by selecting 50000 
random sequences at each length and permuting each 5000 
times. The differences show the importance of the order of 
the nucleotides in the sequence. 



ting procedure we developed for the sequence-dependent 
interactions. We test the parametrization on melting of 
duplexes and hairpins, the latter being a case to which 
the model was not fitted. We then explore the flexibility 
of the model by studying: (a) the heterogeneous stacking 
transition of single strands, (b) the tendency of a duplex 
to fray at its melting point, (c) the effect of stacking 
strength in the loop on the melting temperature of hair- 
pins, (d) the force-extension properties of single strands 
and (e) the structure of a kissing-loop complex. 



II. AVERAGE BASE COARSE-GRAINED DNA MODEL 

The coarse-grained DNA model, on which this work 
is based, is described in detail in Refs. [35] and [36j It 
represents DNA as a string of nucleotides, where each 
nucleotide (sugar, phosphate and base group) is a rigid 
body with interaction sites for backbone, stacking and 
hydrogen-bonding interactions. The potential energy of 
the system is 



V0 = Y1 (^.b. + ^stack + Kxc) 



(ij) 

+ O^HB+Kr.st. +Kxc + Kx.st.), (1) 

where the first sum is taken over all nucleotides that are 
nearest neighbors on the same strand and the second sum 
comprises all remaining pairs. The interactions between 
nucleotides are schematically shown in the Fig. [2j and 



the explicit forms can be found in Refs. l35l and l36l The 
hydrogen bonding (Vhb), cross stacking (Kr.st.) 5 coaxial 
stacking (Kx.st.) an d stacking interactions (V s tack) ex- 
plicitly depend on the relative orientations of the nu- 
cleotides as well as on the distance between interaction 
sites. The backbone potential Vb.b. is an isotropic spring 
that imposes a finite maximum distance between neigh- 
bors, mimicking the covalent bonds along the strand. 
The coaxial stacking term, not shown in the Fig. [2j is 
designed to capture stacking interactions between non- 
neighboring bases, usually on different strands. All in- 
teraction sites also have isotropic excluded volume inter- 
actions V exc or F exc . 

The coarse-grained DNA model of Refs . 1351 and l36l was 
derived in a "top-down" fashion, i.e. by choosing a phys- 
ically motivated functional form, and then focusing on 
correctly reproducing the free energy differences between 
different states of the system, as opposed to a "bottom- 
up" approach that starts from a more detailed repre- 
sentation of DNA and typically focuses on accurate rep- 
resentation of local structural details. The interactions 
were originally fitted to reproduce melting temperatures 
of 'average' oligonucleotides, obtained by averaging over 
the parameters of SantaLucia's model.E^In addition, the 
model is fitted to reproduce the structural and mechan- 
ical properties of double- and single-stranded DNA such 
as the persistence length and the twist-modulus. The 
model allows for base pairing only between Watson- Crick 
complementary bases, but otherwise does not distinguish 
between bases in terms of interaction strengths. 

The model was fitted to reproduce DNA behavior at 
a salt concentration ([Na + ] = 0.5 M) where the electro- 
static properties are strongly screened, and it may be rea- 
sonable to incorporate them into a short-ranged excluded 
volume. Such high salt concentrations are typically used 
in DNA nanotechnology applications, hence motivating 
this approach. It should be noted that the model neglects 
several features of the DNA structure and interactions 
due to the high level of coarse-graining. Specifically, the 
double helix in the model is symmetrical rather than the 
grooves between the backbone sites along the helix hav- 
ing different sizes, and all four nucleotides have the same 
structure. 

The main purpose of this paper is to go beyond the av- 
erage sequence parametrization by introducing sequence- 
dependent interaction strengths into the model. 



III. PARAMETRIZATION OF SEQUENCE-DEPENDENT 
INTERACTIONS 

We choose to perform a thermodynamic parametriza- 
tion of the sequence-dependent interactions, aiming to 
reproduce melting temperatures of short DNA duplexes. 
We seek the parameters that best reproduce the melt- 
ing temperatures as predicted by SantaLucia's model, 47 
which we treat as an accurate fit to experimental data on 
the melting of duplexes of different length and sequence. 
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FIG. 2. The figure shows schematically the interactions be- 
tween nucleotides in the coarse-grained DNA model for two 
strands in a duplex. All nucleotides also interact with a re- 
pulsive excluded volume interactions. The coaxial stacking 
interaction is not shown. 

We restrict sequence dependence to the strength of the 
base pairing and stacking interaction terms, keeping all 
other parameters fixed to the values of the original fit. 

A. SantaLucia's nearest-neighbor model 

In an important series of papers, SantaLuciaS33SI sum- 
marized the results of multiple melting temperatures of 
DNA oligomers, and also presented a nearest-neighbor 
model that reproduces the results of melting experiments 
(hereafter referred to as the SL model). This popular 
model is the basis of a number of widely used oligomer 
secondary structure and melting temperature prediction 
tools.tM^The moc [el assumes that DNA can exist in two 
states, either single-stranded or in duplex form, and gives 
a standard free-energy change of formation AG(T) of the 
duplex with respect to the single strands as a function of 
temperature. The expected yields of duplexes can then 
be calculated as a function of temperature through the 
relation: 

^=exp(-AG(T)/i?r), (2) 

where [A] and [B] are molar concentrations of single 
strands, [AB] is the molar concentration of the duplex 
and R is the molar gas constant. This result assumes 
the system is dilute enough to behave ideally apart from 
associations, a condition fulfilled in the vast majority of 
experiments. 

The SL model assumes that AG(T) is a sum of contri- 
butions, one for each base-pair step formed in a duplex 
with respect to the single-stranded state, along with cor- 
rections for end effects. A base-pair step consists of four 
bases; for example, the base-pair step GT/AC stands for 
a section of duplex that has GT bases on one strand and 
AC on the complementary strand. The SL model has 
10 unique base-pair nucleotide steps: AA/TT, AT/ AT, 
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TA/TA, GC/GC, CG/CG, GG/CC, GA/TC, AG/CT, 
TG/CA, GT/AC, where pairs are given in 3' -5' order 
along the strands. 

The contribution to AG(T) of each term is divided into 
a temperature-independent enthalpy and entropy, so that 
the overall form of AG(T) is given by 



AG(T) = AH- TAS, 



(3) 



with AH and AS being the (temperature-independent) 
sum of the individual contributions to the enthalpy and 
entropy respectively. The SL model is a two-state model, 
in that it considers two regions of state space (the du- 
plex and single-stranded states) and assumes that there 
is a constant enthalpy and entropy difference between the 
two. In other words, it neglects the variation in enthalpy 
within the bound and unbound sub-ensembles. 

The melting temperature T m for a given sequence is de- 
fined in the SL model as the temperature at which half 
of the strands in the system are in the duplex state and 
the other half are in the denatured state. Using this def- 
inition, the SL model has an average absolute deviation 
of 1.6 °C when compared to known experimental melt- 
ing temperatures of 246 duplexes with lengths between 4 
and 16 base pairs™ We fit to the T m as predicted by the 
SL model, rather than having to re-analyse the original 
experimental data. This choice allows us to fit to a large 
ensemble of different sequences whose melting tempera- 
tures we estimate using the SL model. 

We emphasize that, in contrast to the SL model, our 
model itself does not exhibit ideal two-state behavior. 
Although we observe a large difference in the typical en- 
ergies of single-stranded and duplex states, allowing us 
to clearly differentiate the two, we also observe signifi- 
cant variation within these sub-ensembles. Both single- 
stranded and duplex states have multiple microscopic de- 
grees of freedom, which respond differently to changes 
in temperature. For instance, we observe fraying of du- 
plexes (Sec. VB) and that the single strands undergo a 



stacking transition (Sec. VA). The net effect is that the 



AH and AS of transitions that would be inferred from 
our model are not temperature independent, unlike in 
the SL model. 

We note that other models for the prediction of DNA 
melting temperatures exist, such as the recently devel- 
oped nearest-neighbor model of Ref. |53j which uses the 
mechanical unzipping of DNA hairpins to infer the indi- 
vidual base pair step free energies. Our parametrization 
procedure only requires estimates of the melting temper- 
ature for a large set of DNA sequences and could be also 
used to fit our model to the melting temperature predic- 
tions of Ref. [53l 



B. Fitting of the parameters 

Our model was originally parametrized to reproduce 
the melting temperatures of average sequences as pre- 
dicted by the SL model. Since the SL model is con- 



structed on the level of base-pair steps, it cannot be used 
to differentiate between intrastrand interactions within 
a step: for example, AA and TT or AG and CT. We 
therefore set the stacking interaction strengths of bases 
that belong to the same base-pair step to be equal in our 
parametrization procedure. 

To parametrize our coarse-grained DNA model's po- 
tential Vq (Eq.[T]), we scale the ^tack and Vhb interaction 
terms by the factors and rjij respectively, i.e. 



Vr.b. -> otijVa.B. 

^stack ^ ^ij^stack? 



(4) 
(5) 



where and rjij are constants for a given nucleotide pair 
ij. There are therefore 10 parameters rjij (as shown in 
Table |l| and two parameters acG an d at to fit. Making 
the cross-stacking interaction sequence-dependent would 
also influence melting temperatures, but as we will dis- 
cuss later, sequence-dependent stacking and base-pairing 
interactions provide enough parameters to obtain results 
in almost complete agreement with the predictions given 
by the SL model. To fit the 12 coefficients rjij and o^-, 
we used a set S of oligonucleotides of lengths 6, 8, 10, 12 
and 18 for which we calculated the (salt-adjusted) melt- 
ing temperatures using the SL model. The set contained 
2000 randomly generated sequences for each of lengths 8, 
10, 12, 18 and all 4160 sequences of length 6. The set was 
then reduced to contain only heterodimers, leaving 12 022 
sequences in total. We chose to remove homodimers (self- 
complementary sequences) for convenience, because the 
inference of the bulk melting temperatures from simula- 
tions of the formation of a single duplex is different from 
that for heterodimers, as discussed in Ref. [54j 

We select the parameter set that minimizes the func- 
tion: 



(6) 



ses 



where Z^(SL) is the melting temperature of the oligonu- 
cleotide s in the set *S as predicted by the SL model and 
( a ij ? Vij ) i s the melting temperature predicted by our 
model with sequence-dependent base pairing and stack- 
ing potentials o^Vh.b. and 77^ T^tack • To accurately fit 
OLij and rjij , we hence need estimates of the melting tem- 
peratures of many different sequences for many different 
values of the interaction parameters. 

If one simulates a system consisting of two complemen- 
tary strands in the simulation box at exactly the melting 
temperature then the ratio of observed duplex states to 
single-stranded states 



A^duple 



single 



(7) 



should be equal to 2 for heterodimers and 1 for homod- 
imers. The value of 2 for heterodimers is a correction for 
finite size effects that arise when one simulates only two 
strands instead of a bulk ensemble at the same average 
concentration. 54 The correction assumes that the density 
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of strands is low enough that they behave ideally apart 
from association. 

To calculate melting temperatures for the large set 
of sequences S we employed a histogram reweighting 
method Ia5|56j g ene rated once, for each duplex length 
considered, a set of 5000 single-stranded and 10 000 du- 
plex configurations C S i ng i e and Cdupiex- The configura- 
tions in C S ingie and Cdupiex were sampled from the Boltz- 
mann distribution of strands of sequence Sq at the melt- 
ing temperature To using the average parametrization 
(i.e., C6ij — 1 and rjij = 1). Simulations were performed 
in a cell that gave a concentration of 3.36 x 10 -4 M for 
each strand. Twice as many duplex as single-stranded 
states were sampled because they appear in exactly this 
ratio in a simulation of two strands at the melting tem- 
perature of a given sequence in the average model (To). 
Sampling was done at sufficiently large intervals that the 
configurations in C S i ng i e and Cdupiex were uncorrelated. 

In order to find the ratio & S (T, otij,r]ij) for a sequence 
s at temperature T with a parameter set ol^ and rjij that 
corresponds to a potential V (a^- , rjij , T) , states in C s i ng ie 
and Cdupiex were reweighted by the factor 



VZ' ao (T ) ( m ,;.//;,•. T) 

k B T k B T 



wi,s (T,aij,r)ij) = exp 



(8) 

where Vq S °(Tq) is the energy of the Z-th state generated 
at temperature Tq using the sequence sq in the average 
model, and V l,s (a,rj,T) is the sequence-dependent po- 
tential evaluated on the same Z-th state for the sequence 
s. Note that both interaction potentials are a function of 
temperature because the stacking interaction term in the 
model is temperature dependent! * 36 * The configurations 
used in Eq. [8] are generated at T with Vq and sq , but 
each is counted with a weight that corresponds to the 
desired set of new parameters. 

The ratio of the duplex to single-stranded states for a 
given temperature T and parameters ol^ , rjij becomes 



<$> s {T,aij,rjij) = — 2 — 

Z^/cGCsingie W k,S (J- j ®ij,Vij) 



(9) 



where the index Z runs through all generated duplex 
states while k runs through all generated single stranded 
states. Using this method, 3> S (T, o^-, rjij) can be gener- 
ated for a set of temperatures and interpolated in order 
to find T such that 3> S (T, o^-, 77^) = 2, which is by defi- 
nition the melting temperature T m of a given duplex. 

The histogram reweighting method assumes that the 
ensemble of configurations generated at temperature To 
with potential Vb for sequence sq is also representative of 
the state space of the system at temperature T and po- 
tential V(a, rj, T) for sequence s. To check whether we in- 
cluded enough states, we compared the estimation of the 
melting temperature by histogram reweighting of 15 000 
states to an estimation which only used 6000 different 
states. For a test case of 71 000 sequences of oligonu- 
cleotide lengths 8, 12 and 18, the mean absolute deviation 



of the difference between the predicted T m was smaller 
than 0.1 °C, suggesting that the choice of 15 000 states 
provides a large enough ensemble for estimating the melt- 
ing temperatures, at least on average. 

To find a set of parameters that minimizes the func- 
tion in Eq. [5J we ran a simulated annealing algorithm.^ 
We first fitted the base-pairing strengths acG and «at 
while holding the stacking parameters constant. Then 
we fitted the 10 stacking parameters rjij in a second step. 
The separate fitting of the two sets of parameters sim- 
plifies the fitting procedure, as the converged values for 
OLij provide an initial point for the stacking parameters 
fitting. It also allows us to compare the performance of a 
model where only the base-pair interaction strengths are 
sequence-dependent to the one where both base-pairing 
and stacking interactions are sequence-dependent. 

We note that our fitting procedure requires the ability 
to efficiently estimate melting temperatures. The his- 
togram reweighting method, using the generated states, 
takes only about Is to calculate the melting temperature 
of a given sequence. This is a huge reduction in computer 
time as compared to umbrella sampling simulations 
which were used in the parametrization of the original 
average sequence modelP^ The umbrella sampling simu- 
lation samples multiple single- to double-stranded transi- 
tions for a given oligomer and requires around two weeks 
of CPU time to calculate the melting temperature to 
within 0.3° C accuracy for the sequence lengths that we 
considered for our parametrization. Thus our histogram 
re-weighting methodology provides the crucial speed-up 
that made the parametrization possible. 



C. Parametrization results 

While the parameters acG and «at were fairly robust 
to details of the optimization procedure, the parameters 
rjij were more sensitive. In order to uniquely determine 
these parameters we selected the set with the smallest 
average error on an additional test set of 95 958 sequences 
that included all sequences of lengths 5, 6, 7 and 8 for 
which the SL model predicts a T m greater than 0°C for 
the concentration 3.36 x 10 _4 M, plus a set of randomly 
generated sequences of lengths 10, 12 and 18. The final 
set of parameters rjij and a^- , as introduced in Equations 
Q and (p)]), is shown in Table [l[ 

Figure [3] compares a histogram of the difference 



AT m = T m (aij,rjij) - T m (SL) 



(10) 



between the melting temperatures T m (a^-, 77^), calcu- 
lated by our coarse-grained model (using histogram 
reweighting) and the T m (SL) of the SL model, deter- 
mined for each of the 95 958 sequences in our test 
set. The dashed curve shows our model's performance 
when only the base pairing interactions are sequence- 
dependent (parameters q^cg an d a at from Table [i]) and 
the stacking parameters rjij are all set to unity. The solid 
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Base pairing 


OLij 


AT 


0.8292 


GC 


1.1541 


Stacking 


Vij 


GC 


1.027 


CG 


1.059 


AT 


0.947 


TA 


0.996 


GG, CC 


0.978 


GA, TC 


0.970 


AG, CT 


0.982 


TG, CA 


1.009 


GT, AC 


1.019 


AA, TT 


1.042 



TABLE I. Summary of the final parameters that were fitted to 
reproduce melting temperatures of randomly chosen oligonu- 
cleotides as predicted by the SL model. Base steps are in 3'-5' 
direction. 
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FIG. 3. The histogram shows the performance of the fitted 
DNA coarse-grained model for the set of 95 958 test sequences. 
AT m is the difference in the melting temperature predicted 
by the coarse-grained model and by the SL model. The blue 
dashed curve corresponds to a model where only hydrogen- 
bonding interactions were parametrized and the red curve cor- 
responds to the model where the stacking interactions are also 
sequence-dependent (using values from Table 



curve shows the histogram when the melting tempera- 
tures are calculated by our model with both hydrogen 
bonding and stacking sequence-dependent parameters. 
The standard deviation of the distribution of AT m with 
sequence-dependent base-pairing and average stacking is 
2°C, while the standard deviation for the case where 
stacking is also sequence-dependent is 0.85 °C. This com- 
pares to a standard deviation of 8.6 °C for the original 
average-base model. We note that although the average 
deviation is very small, there are a number of melting 
temperatures in our set that differ significantly more than 
one would expect from a Gaussian distribution with this 
standard deviation. These outliers are typically highly 
repetitive sequences. 

Since the SL model has an average absolute deviation 
of 1.6 °C when compared to experimental melting tem- 
peratures of 246 duplexes of lengths between 4 and 16, 
there is little point in trying to further improve our pre- 
dictions with respect to it. That it is possible to repro- 
duce the predictions of the SL model with our set of 12 
parameters also implies that it would not be appropriate 
to introduce sequence dependence for other terms in the 
interaction potential by fitting only to T m (SL). Instead, 
other physical input would be needed. 

It is also important to point out that, as discussed 
previously, by fitting to a model which considers only 
base pair steps it is not possible to distinguish between, 
for example, A A or TT stacking strengths, which are 
known to be different P^l Even though we treat stacking 
within base pair steps equally, our method in principle 
allows the stacking interaction for each individual stacked 
pair to be parametrized differently. But in order to do 
this fitting, new experimental data is needed. We further 
discuss the parametrization of stacking interactions in 
section IV CI 



IV. TESTS OF THE PARAMETRIZATION 

In this Section, we test the performance of our 
sequence-dependent parametrization by comparing the 
melting temperatures of selected duplexes, as well as for 
hairpins, to which the model was not directly fitted. 

We have also tested the structural and mechanical 
properties of double-stranded DNA (away from thermo- 
dynamic transitions) on a randomly generated sequence 
with around 50% GC-content and confirmed that they 
are not changed with respect to those of the original 
average-base parametrization. So our double-stranded 
persistence length remains approximately 125 base pairs, 
and the B-DNA structure produced by the model is the 
same as in Ref s . l35l and l36l On the other hand, the struc- 
tural and mechanical properties of single-stranded DNA 
properties do differ from those of the average model, and 
are studied in Sec. IV Al and IV Dl 



A. Duplex melting 

To further test our histogram reweighting method, we 
calculated several oligomer melting temperatures using 
umbrella sampling Monte Carlo simulations.^ While his- 
togram reweighting method estimates the melting tem- 
perature using the same 15 000 generated states for 
each duplex length considered and extrapolates from the 
average-base to the sequence-dependent potential, um- 
brella sampling simulations are run separately for each 
sequence considered. The umbrella sampling uses the 
sequence-dependent potential and is done close (within 
3 °C) to the melting temperature of given sequence, hence 
providing a more accurate estimation of the melting tem- 
peratures in our model. 
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Sequence T m (US) T 


m fHR) T 




(SL-av 


AAGCGT 


38.0 


38.2 


39.6 


31.2 


GAGATC 


24.4 


24.0 


22.0 


31.2 


TCTCCATG 


44.7 


44.6 


44.6 


48.2 


CCCGCCGC 


71.1 


( U.D 


( 1.1 


A Q 9 
4o.Z 


ATTTATTA 


21.2 


21.3 


23.9 


48.2 


ATATAGCTATAT 


47.0 


49.3 


48.1 


64.7 


ATGCAGCTGCCG 


74.0 


74.3 


72.6 


64.7 


GCGCAGCTGCCG 


79.8 


79.6 


79.0 


64.7 



Sequence 



T m T m (SL) 



TABLE II. Duplex melting temperatures (shown in °C) as 
predicted by our coarse-grained DNA model using umbrella 
sampling Monte Carlo simulations (T m (US)) and histogram 
reweighting (T m (HR)) compared to that for the SL model 
(T m (SL)). T m (SL-avg) is the melting temperature as pre- 
dicted by the averaged SL model, which depends only on the 
length of the sequence. Sequences are specified in 3'-5' direc- 
tion. 



The comparison between the different methods are 
shown in Table [II] for a series of sequences. On aver- 
age the histogram reweighting and the umbrella sampling 
agree to within 0.3 °C, which is very satisfactory. How- 
ever, there is one significant outlier, ATATAGCTATAT, 
for which a difference of 2.3 °C was obtained. One reason 
for the difference may be that the melting temperature is 
about 16.6 °C lower than the melting temperature of an 
average strand of the same length from which the config- 
urations were taken for the histogram reweighting. This 
difference is larger than the typical width of the melt- 
ing transition (around 10 °C for sequences of length 12). 
Moreover, the sequence has a relatively high A-T content 
and may adopt structures with significant fraying at the 
ends that contribute to the ensemble of configurations 
for the actual strand. However, such frayed states might 
have been poorly sampled when the ensemble was gen- 
erated using the average base model. For these reasons, 
the sampled configurations may not provide a good rep- 
resentation of the true state-space of the system. Nev- 
ertheless, a number of other sequences tested here also 
have melting temperatures that differ significantly from 
the average sequence, without exhibiting such a large dif- 
ference in the predicted melting temperatures between 
the two methods. Although it may be true that includ- 
ing a significantly larger set of states in the histogram 
reweighting method could reduce the errors in these out- 
liers, we decided not to pursue this route further, given 
that the accuracy of the underlying SL model is not much 
different than our parametrization errors. Should a sig- 
nificantly more accurate model of the experimental data 
become available, however, then it may be that this point 
needs to be revisited. 



AGCGTCACGC-(T) 6 -GCGTGACGCT 86.5 86.7 

AGTATCAATC-(T) 6 -GATTGATACT 62.2 64.4 

AGCGTC-(T)i -GACGCT 64.5 67.0 

AGTATC-(T)i -GATACT 44.0 47.3 

TABLE III. Hairpin melting temperatures (shown in °C) as 
predicted by our coarse-grained DNA model (T m ) compared 
to the prediction by the SL model T m (SL). Sequences are 
specified in 3'-5' direction. 



B. Hairpin melting temperatures 

We also tested our model's predictions for hairpin 
melting temperatures. This provides a distinct test of 
the parametrized model, since the sequence-dependent 
parameters were fitted to duplex melting temperatures 
only. Importantly, this test also probes the quality of the 
model's description of the single-stranded state, a feature 
often neglected in DNA models. We test melting temper- 
atures of 4 different hairpin-forming sequences with dif- 
ferent stem and loop lengths. We used strong and weak 
stem sequences to highlight sequence effects. 

The simulations were performed with umbrella sam- 
pling using the number of correct base pairs in the stems 
as a reaction coordinate. The melting temperature T m is 
defined as the temperature at which the system spends 
half of the time in the hairpin state, which is in turn de- 
fined as the ensemble of configurations with one or more 
correct base pairs, 
dictions for 



we compare our pre- 



In Table [TTT 
with those obtained from the SL model. 
The average-base parametrization was previously found 
to consistently underestimate T m for hairpins by approx- 
imately 3 °C, but to show the correct variation with loop 
and stem length.^ The sequence-dependent parametriza- 
tion presented here also tends to underestimate T m by 
roughly the same amount, but the sequence effects are 
well captured. 

We further examine the effect of stacking on the melt- 
ing temperature of hairpins with longer loops in sec- 
tion |V C[ where we compare our model with the exper- 
imentally measured influence of sequence content of the 
loop on the hairpin melting temperature, an observation 
which is beyond the SL model. 



V. SEQUENCE-DEPENDENT PHENOMENA 

To demonstrate some of the strengths and weaknesses 
of our new model, we present, in this section, a se- 
ries of studies of DNA systems for which sequence plays 
a non-trivial role. The results were obtained from ei- 
ther Monte Carlo or dynamical simulations of the model. 
The Monte Carlo algorithm used is a Virtual Move 
Monte Carlo algorithm 59 and the molecular dynamics 
simulations were performed using a Brownian dynamics 
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algorithm 60 with the thermostat as described in Ref. [61] 

A. Heterogeneous stacking transition of single strands 

Our model strands undergo a broad stacking transi- 
tion, i.e., a transition from a state with all or the majority 
of neighboring bases coplanarly aligned to a state with 
disrupted alignment, as a function of temperature ! 35 1 36 1 
Such a transition is also generally accepted to occur for 
DNA, although there is not a clear consensus in the lit- 
erature about many aspects of this transition.^ 

To investigate the sequence dependence of stacking 
in our model, we ran Brownian dynamics simulations 
for a 14-base single strand with sequence GCGTCAT- 
ACAGTGC (the same sequence as studied in Ref. [62]) at 
a range of temperatures. We measured the probability 
that a neighbor pair stacks. Two bases are considered 
to be stacked if the magnitude of their stacking inter- 
action energy is at least 6% of its maximal value. The 
choice of a cutoff is one of convenience; we have checked 
that doubling it does not measurably change the results. 
Even though the different stacking strengths do not vary 
from the average by more than 7%, the effects on the 
stacking probabilities are still quite significant. For ex- 
ample, as shown in Fig. Qa), the difference between the 
strongest (GC) and the weakest (AT) stacking pairs is 
large enough that the midpoints of the transitions are 
separated by about 40 °C. 

The structure of the single strands is also heteroge- 
neous, consisting of unstacked and stacked regions of var- 
ious lengths, as illustrated in Fig. Qb). The stacked re- 
gions adopt a helical geometry, whereas the unstacked 
regions are more disordered. 

The strands are also dynamically heterogeneous: over 
time the stacked and unstacked regions grow and shrink, 
while the average probability that a given neighboring 
pair of bases stack varies with temperature and position 
is measured in Figure [4ja) . Mechanical and structural 
properties of the single strands are therefore heteroge- 
neous both in space and in time. 

While we are confident that the existence of signifi- 
cant temporal and spatial heterogeneity in single strands 
is a robust qualitative prediction of our model, given 
the paucity of experimental and theoretical data on the 
detailed stacking interactions between individual bases, 
many questions about the nature and time scales of these 
heterogeneities remain open. 

B. Hybridization free energy profiles of duplexes 

For our average-base parametrization, we have previ- 
ously seen that duplexes at their melting point typically 
have a terminal pair of bases that are unbound. This be- 
havior is called fraying, and it is generally thought that 
the ease of fraying is sequence-dependent with A-T ends 
fraying more readilyP^ To explore the fraying behavior 
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FIG. 4. (a) The stacking probability, calculated as the frac- 
tion of time in the stacked state, varies with temperature and 
is heterogeneous along the sequence. Circles correspond to the 
strongest stacking term, CG (underscored with dotted line in 
sequence), while squares correspond to the weakest stacking 
step, AT (underscored with a dashed line in the sequence). 
Diamonds correspond to the average of all the stacking along 
the sequence, (b) A typical single stranded configuration at 
45 °C. The first two bases on the left are unstacked. The 
strand has three stacked regions which adopt a helical geom- 
etry. 



in our model, we study the free-energy profiles of the 
sequences ATATAGCTATAT, ATGCAGCTGCCG and 
GCGCAGCTGCCG. Note that all three sequences have 
the same four central bases but different ends. 

In Fig. [5] the free energies profiles are shown as a func- 
tion of the number of the native base pairs formed be- 
tween the complementary strands. The free energies were 
set to be equal to in the state with native base pairs, 
i.e. when the duplex is melted. 

Of most interest is how the most stable duplex state 
depends on sequence. For the strand with two G-C ends, 
the free-energy minimum is a state with all 12 bonds 
formed, although the free-energy cost of opening up 1 
base-pair is minimal. By contrast, for the case of ei- 
ther one or two A-T ends, the duplex has the lowest free 
energy in a state with 10 bonds formed. Although the 
system pays an energetic cost for having 2 bonds un- 
formed, it gains entropy from this opening up of the end 
base pairs. Thus, our model strands exhibit fraying, with 
the expected stronger tendency to fray for duplexes with 
weaker A-T ends. Note that the sequence with two A-T 
ends frays despite being at a significantly lower temper- 
ature than the G-C rich sequence. Fraying has many 
consequences for DNA behavior. For instance, it exposes 
the end bases, allowing them to take part in reactions 
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FIG. 5. Free energy profiles for three different duplexes of 
length 12 as a function of the number of complementary (na- 
tive) base pairs of the two strands. The simulations for each 
duplex were run at their respective melting temperatures, 
namely 48 °C, 73 °C and 80 °C. 



with other strands, which is important, for example, in 
a toehold-free displacement process!^ 

Other features of note that are apparent from the free 
energy profiles in Fig. [5] are the nature of the first free 
energy jump and the shape of the minimum correspond- 
ing to the bound state. The fact that the first jump is 
almost the same for all three sequences reflects that it 
is dominated by the loss of center of mass entropy on 
association, which is the same (in units of k^T) for the 
three systems. The shape of the free energy minimum 
corresponding to the duplex highlights differences in the 
ensemble of duplex states for different sequences. For the 
weakest sequence, at the melting point, the duplex can 
have as little as 7 base pairs for a significant fraction of 
the time, and roughly with the same probability as for 
it being fully closed. The most G-C rich sequence, on 
the other hand, shows little tendency to fray even at its 
melting point and it rarely breaks more than 3 base pairs. 



C. Loop sequence effect on hairpin melting temperatures 



In Section |IVB[ we tested our model on melting tem- 
peratures of hairpins with short loops of lengths 6 and 
10. In the SL model, the loop contribution to the free 
energy difference for closing a hairpin is considered to 
be of purely entropic origin and sequence independent. 
However, it was observed experimentally 65 that hairpins 
with the same loop lengths but different sequences have 
different melting temperatures. In particular, the ex- 
periment in Ref. [65] considers sequences with the same 
stem sequence and loops consisting of either poly(dA) 
or poly(dT). The observed difference in melting temper- 
ature of the two different loop sequences was 4°C for 
loop length 12 and increased to 12 °C for loop length 
30, with the poly (d A) loop always having lower melting 
temperature. It was proposed that the strand with a 



FIG. 6. Hairpin melting temperatures as predicted by our 
coarse-grained DNA model as a function of stacking strength 
within the loop. We use a sequence GGGTT-(X) 25 -AACCC, 
where X is taken to stack as A with other bases, and with 
stacking strength r/xx with itself. The sequence is specified in 
3 ; -5' direction. The predicted melting temperature for the SL 
model is 37.8° C. The inset shows stacking probability (P s t) 
within the loop region in the hairpin state (circles) and single- 
stranded case (squares) as a function of stacking strength ?7xx- 



poly (d A) loop region has a higher rigidity in the single- 
stranded case due to the base stacking and thus pays a 
larger penalty for closing. 

Although the experiments in Ref. l65l were done at a salt 
concentration of 0.1 M, lower than the 0.5 M to which our 
model was fitted, it is instructive to see in general how 
stacking in the loop influences the stability of hairpins. 
We calculated the melting temperature for the sequences 
with the same stem sequence as in the experiment and a 
range of stacking strengths in the loop. Since our model 
does not distinguish between AA and TT stacking, we 
use an artificial base type X that is taken to stack as A 
with other bases and distinctly (with stacking strength 
7]xx) with other bases of the same type X. 

The results, summarized in Fig. |6j show that for 
?7xx < 1, the melting temperatures are fairly insensitive 
to stacking strength whereas for rjxx ^ 1, the melting 
temperature starts to drop significantly with increasing 
stacking strength. In the inset of Fig. [6] we show the av- 
erage stacking probability in the loop, compared to that 
of the competing single-stranded state at the same tem- 
perature. In general, as the stacking strength increases, 
the probability that a piece of single-stranded DNA has 
long stacked regions also increases. The geometric con- 
straints of the loop on stacking therefore become more 
pronounced with increasing strength, destabilizing the 
hairpin and leading to a drop in the melting tempera- 
ture. On the other hand, for rj xx < 1, the stacked regions 
have an average length (I) < 3, which is short enough 
that the hairpin geometry does not significantly affect 
the stacking. 

If the data of Ref. [65] are to be interpreted using a 
model of stacking such as ours, we would infer that 
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poly(dA) has a very high stacking probability at these 
temperatures, while poly(dT) has a significantly lower 
one. But, as the inset of Fig. [6] shows, we would not 
conclude that poly(dT) is necessarily largely unstacked. 

It is interesting to note that the stacking strength 
where destabilization becomes noticeable coincides with 
the top end of our fitted strengths, and that if we were 
to separate poly(dT) and poly(dA) stacking strengths, it 
would not require an unreasonable change to give a signal 
of comparable size to that reported in Ref. |65j In par- 
ticular, if one sets t]aa to 1.105 and accordingly adjusts 
t]tt to 0.979 in order to keep the average of the two co- 
efficients the same as for our base-pair step parametriza- 
tion, the obtained difference in melting temperature of 
the hairpins with poly(dA) and poly(dT) loop is about 
9 °C. For these values of t]tt and t]aa the standard devi- 
ation of melting predictions for the set of duplexes used 
in testing our parametrization increases by only 0.1 °C. 
Thus, if one wants to investigate a system where the dif- 
ference in AA and TT stacking strengths plays an im- 
portant role, these coefficients can be used. However, in 
the absence of a systematic study of the effects of loop 
sequence on hairpin melting temperature at high salt, we 
do not include differences between pairs that cannot be 
distinguished by the SL model in our parametrization in 
Table |H 



D. Force-extension curves of single strands 

The mechanical properties of single strands have 
been exper imentally measured for both DNA and 
RNA^^HSHzni to characterize their average as well as 
base-specific properties. In particular, qualitatively dif- 
ferent behavior has been observed for single-stranded 
poly(dT) (poly(rU) in the case of RNA) compared to 
poly(dA) (poly(rC) or poly(rG) in the case of RNA); the 
latter exhibit significant deviations from standard poly- 
mer models such as freely- jointed and wormlike chains, 
whereas the former do not. These deviations — concave 
regions with negative curvature in the forc e-extension 
curves — are described as "plateaus" ! 58 | 66 | 6S | 

To investigate the effects of sequence on the mechanical 
properties of single strands in our model, we simulate me- 
chanical pulling and obtain force-extension curves for 50- 
base strands at room temperature (25 °C). We consider 
polymers corresponding to our weakest and strongest 
stacking sequences, poly(dGdA) and poly(dA), which dif- 
fer in rjij by about 7%. We note that in Sec. VC, 



(a) 40 



we 

used hairpin melting to distinguish AA and TT stack- 
ing strength, but the obtained values are open to enough 
uncertainty that in this section we return to our original 
parametrization. Our focus here is on the qualitative ef- 
fect of stacking differences, rather than their quantitative 
values. 

Fig.[7Ja) shows force-extension curves for our strongest 
and weakest stacking sequences. The concave section for 
strongly-stacked poly(dA) between 15 and 25 pN is qual- 
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FIG. 7. (a) Extension of 50- nucleotide single-stranded DNA 
at 25 °C as a function of applied force. In all panels, blue 
circles correspond to a poly(dA) sequence (strongest stacking 
in our model), while red squares correspond to a poly(dGdA) 
sequence (weakest stacking). The inset in (a) shows a mag- 
nified section of the force-extension curve for low forces, (b) 
Stacking probability of a neighbor pair as a function of the 
applied force F. (c) Average length of a stacked domain 
(I) as a function of applied force F. The open circles and 
crosses show (/) U ncoo P as predicted by the uncooperative stack- 



ing model (Eq. 11) using (P s t) as measured for poly (d A) and 
poly(dGdA) respectively, (d) Visualization of a 50-base long 
poly(dA) ssDNA under a tension F = 15 pN, showing mul- 
tiple stacked regions with helical geometry. The arrows in- 
dicate the applied force on the first and the last base, (e) 
Poly(dGdA) strand under a tension of 15 pN, consisting of 
short stacked regions as well as unstacked ones, (f) Magni- 
fied section of ssDNA illustrates that three stacked bases can 
align with the applied force without disrupting the stacking 
interaction. The contour length d z , aligned with the force, is 
larger than the axial rise d ax is- 
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itatively similar to the plateau-like features observed in 
experiment! 58 * 66 * 6 -^ The relatively weakly- stacked strand, 
poly(dAdG), follows a convex force-extension curve 
which is fairly typical of a classical homo-polymer model. 

The poly(dAdG) curve is similar to the one found 
for the average base model, which in turn is in reason- 
able quantitative agreement with experimental results 
for typical sequences. Although quantitative compari- 
son with experimental data for non-homopolymeric se- 
quences, such as A-phage ssDNAp^ is ha mpered by the 
presence of metastable secondary structure] 69 1 71 1 72 1 at ten- 
sions above about 15 pN, where hairpins are disrupted, 
the extension per base at given force in the average model 
is within 10% agreement with Ref.jTDJ A detailed discus- 
sion of the agreement between the average model and 
experiment is given in Ref. |36j 

To understand the difference between the two single 
strands in our simulations, it is instructive to first recall 
that the strands consist of dynamically changing stacked 
and unstacked regions, as discussed in section [VA} When 
no force is applied, an unstacked region typically has a 
shorter end-to-end distance than a stacked region because 
it is more flexible and hence behaves more like a random 
coil. On the other hand, unstacked regions also have a 
greater maximum extension because the backbone is not 
restricted to a helical geometry as in the case of stacked 
regions. 

To explore the effect of pulling the structure of the sin- 
gle strands, we measured the stacking probability (P st ) 
and the average length (I) of contiguously stacked sec- 
tions for both strands, where a section of length Z con- 
sists of Z + 1 bases. The results, as a function of ap- 
plied force, are plotted in Figs. [7^b) and (c). When no 
force is applied, the stonger-stacking strand poly(dA) has 
(Z) = 8 while the weaker-stacking strand poly(dAdG) 
consists mostly of short stacked regions with average 
length (Z) ^ 2. 

As shown in the inset of Fig. (7^a), at low forces 
the stronger- stacking poly (d A) strand is more extensi- 
ble than the weaker stacking one, by as much as 20% at 
1 pN force. The reason for this difference is that long 
stacked sections have a smaller entropic cost for aligning 
with the applied force than unstacked regions do. How- 
ever, as the force increases further and the strands align 
more with the force, the curves cross (at « 5pN), and 
poly(dA) becomes less extensible because of its shorter 
effective contour length. 

Increasing the force also leads to significant changes in 
the average length of stacked regions in poly(dA). Inter- 
estingly, at low force, the lower entropic cost for aligning 
of longer stacked strands leads to an initial increase in 
(P s t) and (I) with force (up to around 5pN). However, 
as the force increases further, both (P s t) and (Z) start 
to decrease because it becomes favorable for the strand 
to disrupt stacking to allow for greater extension. The 
reduction in stacking is particularly significant for the 
poly(dA) strand over the range 15 to 25 pN, the loca- 
tion of the concave region in the force-extension curve. 



The long stacked regions are broken down into shorter 
ones which facilitates an increase in the overall length of 
the polymer. However, a short region of 3 bases can still 
align its backbone with the force while remaining stacked, 
as illustrated in Fig. (7^f). Therefore, even though it is 
progressively reduced with force for both poly(dA) and 
poly(dGdA), a significant degree of stacking is preserved 
even at high forces. 

The changes in stacking hence explain the phys- 
ical cause of the concave "plateau" region in the 
force-extension curve for the stronger-stacking strand, 
poly(dA). It corresponds to the structural transition as 
the increasing force disrupts the long stacked regions and 
(I) decreases. The concave segment of the force-extension 
curve is not present for poly(dGdA) because the latter 
already consists of mostly short stacked regions at zero 
force. 

The differences in the structure of the poly(dA) 
and poly(dGdA) strands described above are further 
illustrated in Figs. [7^d) and (e), where snapshots of 
the sequences are shown for a force of 15 pN. The 
poly (d A) strands are clearly much more stacked than 
the poly(dGdA) strands are, and also more strongly 
aligned with the force. From this picture one can also 
see why the derivative of the force-extension curve begins 
to rise steeply for the poly(dA) curve around 15 pN: The 
highly stacked strand is nearing its maximum extension, 
whereas the unstacked strand is not. 

It is interesting to note that a mere 20% difference in 
stacking probability between poly(dA) and poly(dGdA) 
at zero force causes a significant difference in the aver- 
age length of stacked regions: (I) = 8 versus (I) = 2. 
This effect can be understood by considering a simple, 
uncooperative model for stacking along the strand. Let 
p be the probability that two neighbors are stacked and 
P(Z) the probability that a stacked cluster has length Z. 
Assuming an infinitely long polymer chain, the probabil- 
ity of having a continuously stacked region of length Z is 
P(Z) = (1 —p) p l , which is the probability of having Z sub- 
sequent base pairs stacked (each with probability p) and 
the (Z + l)-th base not stacked with the next base along 
the chain (which is with probability 1—p). The average 
length (Ouncoop °f a slacked region in this uncooperative 
model can thus be obtained by summing over Z: 

oo 

(O unC oo P = E^(o = I ^- (ii) 

1=0 p 

Since our model has low stacking cooperativity, 35 we can 
make the approximation p « (P s t)- Fig. [7^c) shows 
that this simple model compares remarkably well with 
the measured values of (Z). The fact that (Z) diverges as 
(P s t) approaches 1 explains the sensitivity of the model 
strands to relatively small changes in stacking propensity 
at large (P st ) and also explains the large differences in (Z) 
observed at zero force. 

It is illuminating to compare our results to the the- 
oretical model used by Seol et al. in Ref. 66 to explain 
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the observed force-extension curves of RNA. Their model 
makes similar physical assumptions to the behavior of our 
coarse-grained model: the single strand is split into rigid 
helical regions and flexible random coil regions. Thus the 
basic explanation for the plateau region is the same as in 
our model. However, there are also some differences. For 
example, our model suggests that absence of a plateau in 
the force extension curve does not necessarily mean the 
absence of stacking. In fact, we have observed that short 
stacked regions persist even while pulling the strand at 
a high force, because our model allows for three bases 
to remain stacked while aligning the backbone with the 
applied force, a feature that is not present in the model 
used in Ref. [66j Moreover, the concave region in the 
force-extension curve interpreted with our model would 
indicate the presence of a much stronger stacking propen- 
sity than the one derived in Ref. [66| Although our de- 
scription of single strands is fairly simple, it incorporates 
the underlying physics of the model of Ref. 66 and in ad- 
dition provides an explicit 3-dimensional representation 
of single- stranded nucleic acids. In summary, We believe 
that the presence of concave region in the force exten- 
sion curve suggests that long stacked regions are present 
in the relaxed strand. This would either indicate strong 
uncooperative stacking, as in our model, or large coop- 
erativity in stacking. 

E. The structure of a kissing complex 

In a recent publication, 37 we investigated DNA kiss- 
ing complexes, a system where topological and geometri- 
cal frustration have important effects, and studied the 
ability of the original average base model to describe 
these systems. In this section, we show how the se- 
quence dependence of interactions can introduce non- 
trivial changes to the structure of a kissing complex, with 
potential importance for the operation of nanotechnolog- 
ical systems EHUl 

A kissing complex is a system in which two hairpins 
have loop regions that are complementary and can thus 
at least partially hybridize (see Fig. |8ja)). They are a 
common motif in RNA and are expected to form in DNA 
nanotechnology systems where comple mentar y hairpins 
are used as fuel for DNA nanomachinesP^^ In the ex- 
perimental system realized in Ref. 73, two strands of 40 
nucleotides were designed to be both complementary and 
also able to form a hairpin with a stem of 10 base pairs. 
As the remaining 20-base loops are complementary to 
each other, the two hairpins can form a kissing complex. 
The sequences are^ 

3 ' -CGCAACGACG-GCTCCCCTCTTCTCATTTTA-CGTCGTTGCG-5 > 

and 

3 > -CGCAACGACG-TAAAATGAGAAGAGGGGAGC-CGTCGTTGCG-5 > 

where the hyphens separate stem and loop regions. A 
dilute solution of such strands tends to form hairpins 




Number of correct base pairs 



FIG. 8. Effects of sequence dependence on the structure 
of kissing hairpins, (a) Typical structure found in both the 
average and sequence-dependent parametrization, with 14 in- 
tramolecular base pairs, (b) Second free energy minimum 
found only in the sequence-dependent parametrization, with 
9 intramolecular base pairs. Please note the exposed bases — 
not present in (a) — that can be used as a toehold by the cat- 
alyst strand to initiate displacement, (c) Free energy profile 
for binding with the two parametrizations, with the sequence- 
dependent one exhibiting a second minimum corresponding to 
the structure depicted in (b). 



much more quickly than full duplexes, due to a lower ki- 
netic barrier for the former process. The hairpins in turn 
form kissing complexes, an intermediate met ast able state 
with respect to full hybridization that requires a signifi- 
cant amount of rearrangement to transform into the full 
duplex. The kinetic barrier, due to the topological frus- 
tration of the complex, is so high that full hybridization is 
almost impossible. However, this barrier can be reliably 
resolved by the introduction of a DNA catalyst strand, 
designed to open one of the hairpins by displacement and 
trigger full hybridization, thus releasing the stored free- 
energy. 11 ^ 

Following Ref. 73l we studied in Ref. [37lthe structure of 
the resulting kissing complex with the average sequence 
parametrization. We found that the system typically as- 
sumed a structure with two symmetric parallel helices, as 
shown in Fig. [8^ a). However, as the loop sequences used 
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in Ref.[73|are very asymmetric in G-C content, we expect 
that the average parametrization should overestimate the 
stability of the weakly bound region and conversely un- 
derestimate that of the strongly bound, G-C-rich region. 

When we repeated the structural study with the 
sequence-dependent potential, we obtained a qualita- 
tively different result. Computing the binding free- 
energy profile of the system, using the number of native 
base pairs (i.e. base pairs that would be present in the 
final full duplex) as a reaction coordinate (Fig.|8jc)), we 
found a second minimum at around nine interstrand base 
pairs that was not observed for the average parametriza- 
tion. A typical configuration associated with this min- 
imum is shown in Fig. (8^b). It is evident that as well 
as being able to form the structure with two symmet- 
ric helices, the system is also able to adopt an alterna- 
tive structure with a single intermolecular helix that both 
contains the G-C-rich section and is slightly larger than 
either individual helix in the two-helix form. 

This competing minimum has potentially important 
consequences for the nanotechnological applications of 
kissing hairpins. In Ref. [73j a catalyst strand was intro- 
duced to the system in order to facilitate full hybridiza- 
tion of the complex: the strand was designed to bind to 
the weaker half of one of the loops, and then to open up 
the hairpin by displacement. The fact that a compet- 
ing minimum exists in which the whole weaker half of 
the loop is available for binding will favor this process, 
as it provides a long, easily accessible toehold for dis- 
placement. Such toeholds are known 64 to accelerate dis- 
placement reactions by several orders of magnitude. Our 
model suggests that if the strand was instead designed 
to bind to the stronger half of the loop, its effectiveness 
would be hindered rather than helped by the presence 
of the alternative minimum. We would therefore expect 
such a catalyst to be less effective than the one used in 
Ref. [731 

The qualitative difference between the results of the 
two parametrizations in this case highlights that if one 
is interested in the detailed properties of a system like 
this one, where short binding regions with asymmetric 
G-C content are present, it is important to have a model 
with sequence-dependent binding strengths to be able to 
make more accurate predictions. Were the G-C pairs in 
the loop more evenly distributed, we would expect the 
results of the average parametrization free energy profile 
to accurately describe the kissing complex. 



VI. CONCLUSIONS 

We have extended the nucleotide-level coarse-grained 
DNA model of Ouldridge et a/P^ (which distinguishes 
between A-T and C-G base-pairing but otherwise treats 
these interactions at the average base level) to include 
sequence-dependent stacking and hydrogen-bonding in- 
teractions. To derive the new parameters, we developed 
a histogram reweighting procedure that allowed us to fit 



to thousands of melting temperatures of oligomers rang- 
ing in length from 6 to 18 base pairs. Melting tempera- 
tures were extracted from SantaLucia's nearest-neighbor 
modeP^ which we treat here as a good fit to experiment. 

Sequence can have an important effect on melting tem- 
peratures. For the same length oligomer, but different se- 
quences, melting temperatures can differ by as much as 
50 °C. Even for the same sequence content, but different 
base-pair ordering, variations in stacking energies mean 
that melting temperatures can vary by up to 10 °C. Our 
new parametrization reproduces these differences and on 
average agrees to within a standard deviation of 0.85 °C 
with the SL nearest-neighbor model. In contrast to the 
model's ability to capture thermodynamic properties, our 
coarse-grained model does not attempt to include the ef- 
fects of sequence on structural or mechanical properties 
of double-stranded DNA. Instead, these remain as pre- 
viously reported in Ref. [35j at least for sequences that 
are not extreme in G-C content so that sequence effects 
average out. 

Our new thermodynamic parametrization opens up 
the possibility of investigating sequence-dependent DNA 
phenomena. Specifically, we have considered here the 
following five systems: 

(a) Heterogeneous stacking transition in single- 
stranded DNA: Even though our stacking parameters do 
not vary by more than 7%, they can induce significant 
spatial and temporal heterogeneity in the stacking of sin- 
gle strands. For example the difference in stacking prob- 
ability between the strongest and the weakest stacking 
pairs in the oligomer we studied is large enough that 
the midpoints of the stacking transition of two separate 
pairs in a single strand can be separated by as much as 
40 °C. These results suggest that structural and mechan- 
ical properties of single-stranded DNA should be highly 
heterogeneous as well. 

(b) The hybridization free energy profiles of duplexes: 
We studied three different 12mer sequences at their re- 
spective melting temperatures, finding that sequence het- 
erogeneity also has significant effects on the probability 
that the ends of a duplex are open, i.e. that they fray. 
We found that A-T ends are typically frayed, while se- 
quences with G-C ends exhibit a free-energy minimum 
for a completely closed duplex. 

(c) The effect of stacking strength in the loop on hair- 
pin stability: The SL model only distinguishes base-pair 
steps. Given that we used this model to generate the 
melting temperatures to which we fit, we were unable to 
uniquely isolate the stacking strength of individual base 
combinations. Additional experimental data on single- 
stranded stacking is needed to separate these interac- 
tions. One potential source of data that goes beyond the 
SL model is given by experiments on melting of hairpins 
with poly(dA) and poly(dT) loopsP^By calculating how 
increasing the stacking strength in the loop lowers the 
melting temperatures, we showed that parameters could 
be derived that reproduce the expected stronger AA com- 
pared to TT stacking, without significantly changing the 
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quality of our fit to the overall melting temperatures of 
duplexes. Nevertheless, we do not yet include this dif- 
ference in our parametrization, because to be consistent 
we would need similar data to distinguish between other 
base-pair steps. 

(d) The force- extension properties of single strands: 
Another experimental situation where differences in 
single-stranded stacking have been measured experimen- 
tally is in the force extension of ssDNA. We show that 
more strongly stacked sequences should be more exten- 
sible for small forces up to about 5 pN. For certain se- 
quences, experiments have observed a concave "plateau" 
region in the force-extension curves. We are able to qual- 
itatively reproduce this feature and, in agreement with 
previous explanations, 66 attribute the plateau region to 
the different force response of stiffer stacked and more 
flexible unstacked regions. Furthermore, we show that 
the onset of the plateau region is correlated with a sharp 
decrease in the average length of stacked regions with 
increasing force. Because the average length of stacked 
regions drops rapidly with a relatively small decrease in 
the average stacking, we argue that a very large propen- 
sity to stack (> 90%) is necessary to give a similar results 
to those observed in experiment. We therefore conclude 
that if these phenomena are to be explained through 
largely uncooperative stacking of bases to form helical 
ssDNA, as in our model, a high stacking propensity is 
required. Furthermore, failure to observe a force plateau 
for a sequence does not imply an absence of stacking. 

(e) The structure of a kissing-loop complex: Finally, 
we applied our model to study the effect of sequence on 
the structure of a kissing complex formed by two hair- 
pins. When the sequences used in the experiments of 
Ref. [73] are studied, the average base model exhibits one 
minimum free-energy structure)^ while our sequence- 
dependent model also generates a second, qualitatively 
distinct, stable structure. The new structure completely 
exposes a toehold which may significantly accelerate the 
DNA catalyst mediated release of free energy stored in 
the kissing complex. 

The examples described above suggest that our model 
can be used for many other DNA applications in nan- 
otechnology and biology where sequence plays a signifi- 
cant role. Our model should work particularly well for 
situations where single-to-double stranded transitions are 
important. Nevertheless, users of our model should re- 
main aware of some limitations. Firstly, the model is only 
fit to a single salt concentration of [Na + ] = 0.5M, where 
the electrostatic properties are strongly screened. A new 
kind of parametrization may be necessary to reach sig- 
nificantly lower salt concentrations. Secondly, the model 
lacks certain detailed local structural information, such 
as major and minor grooving, or sequence dependent 
elastic parameters. Furthermore, our model was fit to 
data that only includes the effects of base-pair steps. Ad- 
ditional experimental data on single-stranded stacking is 
needed to separate out the stacking strength of individ- 
ual base combinations. Applications where the effects we 



neglect are crucial may therefore be best studied by other 
models. 

We are developing further improvements to the model, 
but our work also highlights the need for new systematic 
experiments, in particular to elucidate the basic physics 
of single- stranded stacking interactions. Such informa- 
tion would be also of great help to those studying DNA- 
coated colloids P2 

To summarize, we have introduced a new coarse- 
grained model of DNA that has been parametrized 
to reproduce the thermodynamic effects of sequence- 
dependent interactions. The current version of the model 
provides a computationally efficient and physically accu- 
rate tool for the study of problems ranging from DNA 
nanotechnology to biology. To facilitate its use, we have 
made simulation code implementing Monte Carlo and 
Brownian dynamics for the model available as free soft- 
ware called oxDNA at http://dna.physics.ox.ac.uk, 



VII. ACKNOWLEDGMENTS 

The authors would like to thank Erik Winfree, Filip 
Lankas, Felix Ritort and Agnes Noy for helpful discus- 
sions. The authors also acknowledge financial support 
from the Engineering and Physical Sciences Research 
Council, University College (Oxford), and from the Ox- 
ford Supercomputing Centre for computer time. P. S. is 
grateful for the award of a Scatcherd European Scholar- 
ship. 

1 W. Saenger, Principles of Nucleic Acid Structure, Springer- 
Verlag, New York, 1984. 
2 L. Adleman, Science 266, 1021 (1994). 

3 J. Bath, S. J. Green, and A. J. Turberfield, Angew. Chem. Int. 

Ed. 117, 4432 (2005). 
4 J. Bath, S. J. Green, K. E. Allan, and A. J. Turberfield, Small 

5, 1513 (2009). 

5 E. Winfree, F. R. Liu, L. A. Wenzler, and N. C. Seeman, Nature 

394, 539 (1998). 
6 P. W. K. Rothemund, Nature 440, 297 (2006). 
7 J. Sponer, K. E. Riley, and P. Hobza, Phys. Chem. Chem. Phys. 

10, (2008). 

8 A. Perez, A. Noy, F. Lankas, F. J. Luque, and M. Orozco, Nucleic 

Acids Res. 32, 6144 (2004). 
9 P. Hobza and J. Sponer, Chem. Rev. 99, 3247 (1999). 
10 J. Sponer et al., Chem. Eur. J. 12, 2854 (2006). 
n D. Svozil, P. Hobza, and J. Sponer, J. Phys. Chem. B 114, 1191 
(2010). 

12 J. Sponer, P. Jurecka, and P. Hobza, J. Am. Chem. Soc. 126, 
10142 (2004). 

13 W. D. Cornell et al., J. Am. Chem. Soc. 117, 5179 (1995). 
14 B. R. Brooks et al., J. Comput. Chem. 4, 187 (1983). 
15 A. Perez, F. J. Luque, and M. Orozco, Acc. Chem. Res. 45, 196 
(2012). 

16 E. J. Sambriski, D. C. Schwartz, and J. J. de Pablo, Biophys. J. 
96, 1675 (2009). 

17 J. C. Araque, A. Z. Panagiotopoulos, and M. A. Robert, J. Chem. 

Phys. 134, 165103 (2011). 
18 M. C. Linak, R. Tourdot, and K. D. Dorfman, J. Chem. Phys. 

135, 205102 (2011). 
19 K. Drukker, G. Wu, and G. C. Schatz, J. Chem. Phys. 114, 579 

(2001). 



15 



20 M. Sales-Pardo, R. Guimera, A. A. Moreira, J. Widom, and 

L. Amaral, Phys. Rev. E 71, 051902 (2005). 
21 M. Kenward and K. D. Dorfman, J. Chem. Phys. 130, 095101 

(2009). 

22 T. E. Ouldridge, I. G. Johnston, A. A. Louis, and J. P. K. Doye, 

J. Chem. Phys. 130, 065101 (2009). 
23 T. A. Knotts, IV, N. Rathore, D. Schwartz, and J. J. de Pablo, 

J. Chem. Phys. 126 (2007). 
24 A.-M. Florescu and M. Joyeux, J. Chem. Phys. 135, 085105 

(2011). 

25 A. Morriss-Andrews, J. Rottler, and S. S. Plotkin, J. Chem. 
Phys. 132, 035105 (2010). 

26 A. V. Savin, M. A. Mazo, I. P. Kikot, L. I. Manevitch, and A. V. 
Onufriev, Phys. Rev. B 83, 245406 (2011). 

27 P. D. Dans, A. Zeida, M. R. Machado, and S. Pantano, J. Chem. 
Theory Comput. 6, 1711 (2010). 

28 A. Savelyev and G. A. Papoian, Biophys. J. 96, 4044 (2009). 

29 N. B. Becker and R. Everaers, Phys. Rev. E 76, 021923 (2007). 

30 F. Lankas, Innovations in Biomolecular Modeling and Simu- 
lations, volume 2 of RSC Biomolecular Sciences, The Royal 
Society of Chemistry, 2012. 

31 T. Dauxois, M. Peyrard, and A. R. Bishop, Phys. Rev. E 47, 
684 (1993). 

32 C. Nisoli and A. R. Bishop, Phys. Rev. Lett. 107, 068102 (2011). 
33 S. Cocco and R. Monasson, Phys. Rev. Lett. 83, 5178 (1999). 
34 T. E. Ouldridge, A. A. Louis, and J. P. K. Doye, Phys. Rev. 

Lett. 104, 178101 (2010). 
35 T. E. Ouldridge, A. A. Louis, and J. P. K. Doye, J. Chem. Phys. 

134, 085101 (2011). 
36 T. E. Ouldridge, Coarse-grained modelling of DNA and DNA 

nanotechnology, PhD thesis, University of Oxford, 2011. The 

thesis is available at http://tinyurl.com/7ycbx7c 
37 F. Romano, A. Hudson, J. P. K. Doye, T. E. Ouldridge, and 

A. A. Louis, J. Chem. Phys. 136, 215102 (2012). 
38 C. De Michele, L. Rovigatti, T. Bellini, and F. Sciortino, Soft 

Matter , in press (2012). 
39 C. Ma tek, T. E. Ouldri dge, A. Levy, J. P. K. Doye, and A. A. 

Louis, |arXiv:1 206.2636vl (2012). 
40 C. Calladine and H. Drew, Understanding DNA: the molecule & 

how it works, Academic Press, 1997. 
41 W. K. Olson, A. A. Gorin, X.-J. Lu, L. M. Hock, and V. B. 

Zhurkin, Proc. Natl. Acad. Sci. USA 95, 11163 (1998). 
42 S. Geggier and A. Vologodskii, Proc. Natl. Acad. Sci. USA 107, 

15421 (2010). 

43 B. Basham, G. P. Schroth, and P. S. Ho, Proc. Natl. Acad. Sci. 
USA 92, 6464 (1995). 

44 We use poly-dC,dA,dT,and dG notation for DNA sequences with 
repeated nucleotide content to distinguish them from RNA se- 
quences, which are referred to with rC, rA, rU and rG. 

45 B. Alberts et al., Molecular Biology of the Cell, fourth edition, 
Garland Science, 2002. 

46 V. Ortiz and J. J. de Pablo, Phys. Rev. Lett. 106, 238107 (2011). 

47 J. SantaLucia, Jr., Proc. Natl. Acad. Sci. U.S.A 17, 1460 (1998). 

48 J. SantaLucia, Jr. and D. Hicks, Annu. Rev. Biophys. Biomol. 
Struct. 33, 415 (2004). 

49 J. N. Zadeh et al., J. Comput. Chem. 32, 170 (2011). 



50 M. Markham, Nicholas R.and Zuker, Methods. Mol. Bio. 453, 3 

(2008) . 

51 N. R. Markham and M. Zuker, Nucleic Acids Res. 33, W577 
(2005). 

52 N. L. Novere, Bioinformatics 17, 1226 (2001). 
53 J. M. Huguet et al., Proc. Natl. Acad. Sci. USA 107, 15431 
(2010). 

54 T. E. Ouldridge, A. A. Louis, and J. P. K. Doye, J. Phys.: 

Condens. Matter 22, 104102 (2010). 
55 A. M. Ferrenberg and R. H. Swendsen, Phys. Rev. Lett. 61, 2635 

(1988). 

56 D. Landau and K. Binder, A Guide to Monte Carlo Simulations 

in Statistical Physics, Cambridge University Press, New York, 

NY, USA, 2005. 
57 G. Torrie and J. P. Valleau, J. Comp. Phys. 23, 187 (1977). 
58 W.-S. Chen et al., Phys. Rev. Lett. 105, 218104 (2010). 
59 S. Whitelam, E. H. Feng, M. F. Hagan, and P. L. Geissler, Soft 

Matter 5, 1521 (2009). 
60 D. Frenkel and B. Smit, Understanding Molecular Simulation: 

From Algorithms to Applications, Academic Press, Inc., Orlando, 

FL, USA, 1st edition, 1996. 
61 J. Russo, P. Tartaglia, and F. Sciortino, J. Chem. Phys. 131, 

014504 (2009). 

62 J. Holbrook, M. Capp, R. Saecker, and M. Record, Biochemistry 
38, 8409 (1999). 

63 S. Nonin, J.-L. Leroy, and M. Gueron, Biochemistry 34, 10652 
(1995). 

64 D. Y. Zhang and E. Winfree, J. Am. Chem. Soc. 131, 17303 

(2009) . 

65 N. L. Goddard, G. Bonnet, O. Krichevsky, and A. Libchaber, 

Phys. Rev. Lett. 85, 2400 (2000). 
66 Y. Seol, G. M. Skinner, K. Visscher, A. Buhot, and A. Halperin, 

Phys. Rev. Lett. 98, 158103 (2007). 
67 Y. Seol, G. M. Skinner, and K. Visscher, Phys. Rev. Lett. 93, 

118102 (2004). 

68 G. Mishra, D. Giri, and S. Kumar, Phys. Rev. E 79, 031930 
(2009). 

69 M.-N. Dessinges et al., Phys. Rev. Lett. 89, 248102 (2002). 
70 S. B. Smith, Y. Cui, and C. Bustamante, Science 271, 795 (1996). 
71 Y. Zhang, H. Zhou, and Z.-C. Ou-Yang, Biophys. J. 81, 1133 
(2001). 

72 A. Montanari and M. Mezard, Phys. Rev. Lett. 86, 2178 (2001). 
73 J. Bois et al., Nucleic Acids Res. 33, 4090 (2005). 
74 R. M. Dirks and N. A. Pierce, Proc. Natl. Acad. Sci. USA 101, 
15275 (2004). 

75 S. Venkataraman, R. M. Dirks, P. W. K. Rothemund, E. Winfree, 
and N. A. Pierce, Nat. Nanotechnol. 2, 490 (2007). 

76 S. J. Green, J. Bath, and A. J. Turberfield, Phys. Rev. Lett. 
101, 238101 (2008). 

77 P. Yin, H. M. Choi, C. R. Calvert, and N. A. Pierce, Nature 
451, 318 (2008). 

78 R. A. Muscat, J. Bath, and A. J. Turberfield, Nano Lett. 11, 
982 (2011). 

79 S. J. Green, D. Lubrich, and A. J. Turberfield, Biophys. J. 91, 
2966 (2006). 

80 B. M. Mladek, J. Fornleitner, F. J. Martinez- Veracoechea, 
A. Dawid, and D. Frenkel, Phys. Rev. Lett. 108, 268301 (2012). 



